fix: cache worker IPC state for prealloc thread by shipiyouniao · Pull Request #336 · ovg-project/kvcached

shipiyouniao · 2026-05-19T13:06:53Z

Summary

Fix single-GPU startup hangs by caching the worker-IPC decision for the C++ background prealloc thread instead of calling back into Python from that thread.

Related issue: #334

Root cause

On world_size=1, the background prealloc thread enters PageAllocator::map_pages() during startup and evaluates should_use_worker_ipc().

That check was implemented as a Python callback. The prealloc thread therefore tried to re-enter Python while the main Python thread was still inside extension code and holding the GIL during startup. The background thread stalled before the first KV page mapping call, preallocation never completed, and the server never became ready.

The previous workaround avoided the hang by skipping the background prealloc thread for world_size=1, but it did not fix the underlying cause.

Changes

Cache the should_use_worker_ipc() result when the callback is installed from the Python-owned thread.
Make the C++ prealloc thread read that cached value instead of calling back into Python.
Keep the existing callback path for non-prealloc-thread callers.
Remove the Python-side single-GPU skip workaround so world_size=1 follows the normal prealloc flow again.

Why this approach

This keeps the behavior change narrowly scoped to the background prealloc thread, which is the only place that needs to avoid Python callback re-entry during startup. It restores the intended preallocation behavior on single-GPU startup instead of bypassing it.

Validation

Validated against the single-GPU reproduction described in issue #334:

vLLM 0.18.1
world_size=1
kvcached enabled
async scheduling enabled
eager mode enabled

Observed after the fix:

Background prealloc runs on single-GPU startup
Log shows Mapped 3 pages to KV tensors
Log shows Preallocated 3 pages, reserved=3
GET /health returns 200
GET /v1/models returns 200
Real POST /v1/chat/completions returns 200

Notes

This replaces the earlier workaround commit that skipped background preallocation for world_size=1. The new change fixes the verified deadlock mechanism directly.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Prevents a known deadlock during single-GPU startup by conditionally disabling the C++ background preallocation thread when world_size == 1.

Changes:

Gate start_prealloc_thread() behind world_size > 1
Add inline explanation describing the deadlock scenario
Add an info log when preallocation is skipped for single-GPU runs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Avoid calling the Python worker IPC callback from the C++ prealloc thread. On single-GPU startup that callback can block on the GIL before the first page mapping call, which hangs background preallocation and keeps the server from becoming ready.

Copilot AI review requested due to automatic review settings May 19, 2026 13:06

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread kvcached/kv_cache_manager.py Outdated

Comment thread kvcached/kv_cache_manager.py Outdated

Comment thread kvcached/kv_cache_manager.py Outdated

shipiyouniao force-pushed the fix/single-gpu-startup-hang branch from bcf4d13 to 10682ec Compare May 19, 2026 13:19

shipiyouniao requested a review from Copilot May 19, 2026 13:23

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread kvcached/kv_cache_manager.py Outdated

Comment thread kvcached/kv_cache_manager.py Outdated

shipiyouniao force-pushed the fix/single-gpu-startup-hang branch from 10682ec to a48f046 Compare May 20, 2026 03:42

shipiyouniao changed the title ~~fix: skip background prealloc for world_size=1~~ fix: cache worker IPC state for prealloc thread May 20, 2026

shipiyouniao force-pushed the fix/single-gpu-startup-hang branch 2 times, most recently from a481bb1 to 7de21c7 Compare May 20, 2026 03:55

shipiyouniao requested a review from Copilot May 20, 2026 03:56

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread csrc/page_allocator.cpp Outdated

Comment thread csrc/page_allocator.cpp Outdated

Comment thread csrc/page_allocator.cpp Outdated

shipiyouniao force-pushed the fix/single-gpu-startup-hang branch from 7de21c7 to f51e3ab Compare May 20, 2026 05:29

shipiyouniao requested a review from Copilot May 20, 2026 05:32

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread csrc/page_allocator.cpp

Comment thread csrc/page_allocator.cpp Outdated

Comment thread csrc/page_allocator.cpp Outdated

Comment thread csrc/page_allocator.cpp

shipiyouniao force-pushed the fix/single-gpu-startup-hang branch from f51e3ab to 6f20be5 Compare May 20, 2026 05:38

shipiyouniao mentioned this pull request May 21, 2026

TP2 vLLM startup requires two kvcached fixes before the server becomes ready #339

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cache worker IPC state for prealloc thread#336

fix: cache worker IPC state for prealloc thread#336
shipiyouniao wants to merge 1 commit into
ovg-project:mainfrom
shipiyouniao:fix/single-gpu-startup-hang

shipiyouniao commented May 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shipiyouniao commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Changes

Why this approach

Validation

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shipiyouniao commented May 19, 2026 •

edited

Loading